Author: Adam Bukowski
Project: Churn prediction for bank customers
# core libraries for data analysis and visualisation
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# libraries for modelling
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV, RepeatedStratifiedKFold, ParameterGrid
import xgboost as xgb
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn import metrics
import shap # to look inside the black box
# additional
from IPython.display import HTML # for code hide / show button
import pickle # saving and reading datasets / models
import types # for modifying existing object's method
from copy import deepcopy # to copy the model object, alter its method and not change the original
import warnings # for ignoring the warnings
# global settings
sns.set_context("talk")
warnings.filterwarnings("ignore", category = UserWarning)
print("list of used libraries along with their versions:")
!pip3 list | find "numpy "
!pip3 list | find "pandas "
!pip3 list | find "seaborn"
!pip3 list | find "matplotlib"
!pip3 list | find "scikit-learn"
!pip3 list | find "xgboost"
!pip3 list | find "shap "
print("pickle ", pickle.format_version)
# Button to hide / show code in entire notebook
# always visible in the bottom right corner
HTML('''<script>
code_show=true;
function code_toggle() {
if (code_show){
$('div.input').hide();
} else {
$('div.input').show();
}
code_show = !code_show
}
$( document ).ready(code_toggle);
</script>
<button onclick="javascript:code_toggle()"
style = "background-color: #4CAF50; /* Green */
border: none;
color: white;
padding: 15px 32px;
text-align: center;
text-decoration: none;
display: inline-block;
font-size: 16px;
position: fixed;
bottom: 0px;
right: 20px;">
Hide / Show code
</button>
''')
The aim of the project is to build a good quality model predicting customer churn from the bank. This is a supervised classification problem.
During the project we will:
We will use the dataset publicly available on Kaggle website: https://www.kaggle.com/mathchi/churn-for-bank-customers
# reading the downloaded file
df = pd.read_csv('./Churn_Modelling.csv')
print("There are {} observations in the dataset".format(df.shape[0]))
print("There are {} features in the dataset".format(df.shape[1]))
print("Here are the first 5 rows of the data (transponed):")
df.head().T
print("We can see 3 text features:")
df.describe(include = "object").T
print("And 11 numeric features:")
# We will supress scientific notation for more clear picture
df.describe().apply(lambda s: s.apply('{0:.2f}'.format)).T
It is worth pointing out right away that we have no missing values in our data (count 10000 for each variable in both summary tables and number of observations in our dataset is 10000). It means one less step in our analysis as typicaly we would have to treat missing values somehow before modelling (usually by mean / median / most frequent category imputation)
Some of these numeric variables should rather be considered as categorical variables which we can observe by looking at the number of unique values for each feature:
df.apply(lambda s: s.nunique()).to_frame("unique")
For 10000 observations we see 10000 different CustomerId so we should expect 10000 distinct customers being described by the data. On the other hand, we see 2932 unique surnames which in theory could mean that one customer is described by more than one row (although of course surname is not a unique identifier of a person).
To be certain that each row describes different customer we took a subset of features (other than CustomerId) and showed below that each unique combination of these features occurs only once in our dataset:
df.groupby(['Surname','Age','Tenure','Geography','CreditScore']).size().value_counts().to_frame("#")
After first independent verification of the features and their types and values we can prepare a brief description of the features (also utilising information about the data from Kaggle website):
After some initial EDA in the previous section we already know quite a lot about feature types and have some idea about their distribution. Here we will look for some insights in terms of potential usefulness of these features for churn modelling
First of all, we will drop CustomerId and Surname which as we already justified will not be useful. We will keep RowNumber as unique identifier of observations for potential future joins but we will not analyse it nor take into account during modelling
# dropping not neeeded columns
df.drop(['CustomerId', 'Surname'], axis = 1, inplace = True)
Then we will look on our target variable - Exited. As we can see on the figure below, we have slightly over 20% churners in our dataset. This means we have a class imbalance and we will have to be careful during modelling. For example, without any effort we can have a model with almost 80% accuracy just by always predicting "no churn" class. Of course such model is not useful at all. We will have to focus on positive class which is especially important modelling churn considering that most often it is easier / cheaper to prevent customer from leaving the bank than to acquire a new one. We will discuss it further deciding how to measure performance of our models
plt.figure(figsize=(10,6))
# calculating % of each class for labels over bars
labels = df.groupby('Exited').size().to_frame("cnt").assign(pr = lambda x: x.cnt / sum(x.cnt))
f = sns.countplot(x = 'Exited', data = df)
# adding labels over bars
for i, p in enumerate(f.patches):
f.annotate('{:.0f} ({:.2%})'.format(p.get_height(), labels.pr[i]), (p.get_x()+0.2, p.get_height()+100))
f.set(ylim=(0, 9000))
plt.title('Distribution of Exited variable')
plt.show()
Then we will analyse variables that can be considered categorical.
def EDA_categorical(data, variable, target):
'''Function producing two side by side plots for categorical variable:
- countplot to show distribution between all categories
- share of positive values of target variable for each category
Args:
data, dataframe: data with at least two columns - specified variable and target 0/1 column
variable, string: name of the categorical variable we want to analyse
target, string: name of the 0/1 target variable
Returns:
Nothing
'''
fig, (ax1, ax2) = plt.subplots(nrows = 1, ncols = 2, figsize = (15, 3))
sns.countplot(x = v, data = data, ax = ax1, palette = 'Blues')
sns.barplot(x = variable, y = target, data = data, ax = ax2, ci = None, palette = 'Blues')
# adding labels for bars
for p in zip(ax1.patches,ax2.patches):
ax1.annotate('{:.0f}'.format(p[0].get_height()), (p[0].get_x(), 100))
ax2.annotate('{:.1%}'.format(p[1].get_height()), (p[1].get_x(), 0.02))
plt.suptitle(variable + ": distribution & what share of each category exited")
plt.show()
categorical = ['Geography', 'Gender', 'NumOfProducts', 'HasCrCard','IsActiveMember']
for v in categorical:
EDA_categorical(df, v, 'Exited')
From the set of plots above we can draw the following conclusions:
def bucket_continuous(data, v, no_buckets):
'''Function adding to data additional variable - buckets of supplied
continuous variable. Buckets are generated based on quartiles. Name of the new variable is the
original name with the prefix 'CAT__'. It is worth to note that if we divde given variable into 4
buckets and one value of given variable is observed on > 25% of whole dataset we do not raise the
error. Instead we drop duplicated bucket boundry (thus in fact we merge 2 buckets)
Args:
data, dataframe: data with column v
v, string: name of the presumably continuous column existing in data.
no_buckets, integer: number of buckets we want to achieve after grouping variable v based on quantiles
Returns:
dataframe: data with column name 'CAT__{v}' with enumerated buckets calculated based on variable v
'''
return data.assign(**{"CAT__"+v: lambda x: pd.qcut(x[v], no_buckets, labels=False, duplicates = 'drop')})
def EDA_continuous(data, variable, target):
'''Function producing two side by side plots for continuous variable:
- ECDF to show distribution - separately for both values of target variable
- share of positive values of target variable for each of the automatically generated buckets of supplied
continuous variable. Buckets are generated based on quartiles
Args:
data, dataframe: data with at least two columns - specified variable and target 0/1 column
variable, string: name of the continuous variable we want to analyse
target, string: name of the 0/1 target variable
Returns:
Nothing
'''
fig, (ax1, ax2) = plt.subplots(nrows = 1, ncols = 2, figsize = (15, 3))
sns.ecdfplot(x = variable, data = data, ax = ax1, hue = target)
sns.barplot(x = variable+" (~quartiles)", y = target,
data = bucket_continuous(data, variable, 4). \
rename({"CAT__"+variable: variable+" (~quartiles)"}, axis = 1),
ax = ax2, ci = None, palette = 'Blues')
# adding labels for bars
for p in ax2.patches:
ax2.annotate('{:.1%}'.format(p.get_height()), (p.get_x(), 0.02))
plt.suptitle(variable + ": distribution (ECDF) & what share of each bucket exited")
plt.show()
continuous = ['CreditScore', 'Age', 'Tenure', 'Balance','EstimatedSalary']
for v in continuous:
EDA_continuous(df,v,'Exited')
print("Additional check for Balance 0 vs >0: ")
df.assign(Balance = lambda x: ['Balance = 0' if b == 0 else 'Balance > 0' for b in x.Balance]). \
groupby('Balance'). \
agg(cnt = ('Balance','size'), exited = ('Exited', sum)). \
assign(**{"% exited" : lambda x: ['{:.2%}'.format(p) for p in x.exited / x.cnt]})
For each continuous variable we presented two plots:
From these plots we draw the following conclusions:
# setting the seed for reproducibility
np.random.seed(3103)
plt.figure(figsize=(10,6))
sns.ecdfplot(x = 'EstimatedSalary', data = df)
sns.ecdfplot(x = 'random from 0 to 200k',
# adding 10000 values randomly generated from uniform distribution between 0 and 200000 as new variable
data = df.assign(**{'random from 0 to 200k': lambda x: np.random.rand(10000,1)*200000}))
plt.legend(labels = ["EstimatedSalary", "random from 0 to 200k"])
plt.title("ECDF comparison\n EstimatedSalary vs random from uniform distribution")
plt.show()
In fact we can also see regular behaviour of ECDF for Tenure variable which along with the fact that it basically does not differentiate churners and non-churners raises some suspisions regarding legitimacy of this variable. Below we looked at the distribution of Tenure and in our opinion it is not unlikely that it was generated from discrete distribution where 0 and 10 have about 5% probability each and values from 1 to 9 have about 10% probability each. While it would be rather rare to draw only 413 instances of 0 values in such case, our simulation gave quite similar results
def summarise_discrete_dist(data, variable):
'''Function summarising data by presumably discrete variable and presenting useful distribution information
in the context of this churn analysis
Args:
data, dataframe: data with at least two columns - specified variable and Exited numeric column
variable, string: name of the variable by which we want to group
Returns:
dataframe: counts, number and % of exited and % of total for each value of supplied variable
'''
return data.groupby(variable). \
agg(cnt = (variable,'size'), exited = ('Exited', sum)). \
assign(**{"% of total" : lambda x: ['{:.1%}'.format(p) for p in x.cnt / sum(x.cnt)],
"% of bucket exited": lambda x: ['{:.1%}'.format(p) for p in x.exited / x.cnt]}).T
display(summarise_discrete_dist(df,'Tenure'))
# setting the seed for reproducibility
np.random.seed(3103)
# possible discrete values - from 0 to 10
elements = list(range(11))
# probabilities for each value - 5% for 0 and 10, 10% for 1 to 9
probabilities = [0.05] + list(np.ones(9)*0.1) + [0.05]
display(summarise_discrete_dist(
# adding 10000 values from discrete distribution with parameters defined above
df.assign(**{'Tenure (simulated)': lambda x: np.random.choice(elements, 10000, p=probabilities)}),
'Tenure (simulated)'))
It is worth noting that after exploring the variables we can say that all their values seems reasonable (given the variables meaning - for example no 200 years old customers). We also did not spot any values that we think should be considered as outliers therefore we will not perform any actions to treat the potential outliers (which is often one of the data preprocessing steps)
One last thing to notice from our EDA is that it seems that our data about customers represents values from one point of time while the information whether customer exited or not comes from some later point in time. Otherwise we would rather not see customers who exited the bank but still are active or have a credit card. Of course data prepared in that way are resonable as we want to predict the future churn to be able to prevent it.
Typically, the feature engineering is a vital part of any data science modelling project. In our case we are heavily limited by small number of original features available in the dataset. Typically for churn modelling the most useful insights comes from observing changes in behaviour over time but this would require having several observations from different points in time for each customer (which we do not have).
We will construct some ratio type variables and one-hot encode our categorical variables
def add_new_features(data):
'''Function adding new features designed for the dataset analysed in this project
Args:
data, dataframe: dataset analysed in this project
Returns:
dataframe: dataframe extended by new features
'''
return data.assign(
# ratios
NumOfProductsByAge = lambda x: x.NumOfProducts / x.Age,
CreditScoreByAge = lambda x: x.CreditScore / x.Age,
BalanceByAge = lambda x: x.Balance / x.Age,
BalanceBySalary = lambda x: [250000 if b < 1 else a/b for a,b in zip(x.Balance, x.EstimatedSalary)],
BalanceByTenure = lambda x: [a if b < 1 else a/b for a,b in zip(x.Balance, x.Tenure)],
BalanceByCreditScore = lambda x: x.Balance / x.CreditScore,
TenureByAge = lambda x: x.Tenure / x.Age,
SalaryByAge = lambda x: x.EstimatedSalary / x.Age,
SalaryByCreditScore = lambda x: x.EstimatedSalary / x.CreditScore,
# one-hot encoding
FromGermany = lambda x: [1 if g == 'Germany' else 0 for g in x.Geography],
FromSpain = lambda x: [1 if g == 'Spain' else 0 for g in x.Geography],
IsFemale = lambda x: [1 if g == 'Female' else 0 for g in x.Gender],
NumOfProductsTwo = lambda x: [1 if n == 2 else 0 for n in x.NumOfProducts],
NumOfProductsMany = lambda x: [1 if n > 2 else 0 for n in x.NumOfProducts]
)
df_fe = add_new_features(df)
print("Number of columns before feature engineering: ", df.shape[1])
print("Number of columns after: ", df_fe.shape[1])
Let us first do some quick EDA of our new ratios
new_continuous = df_fe.loc[:,"NumOfProductsByAge":"SalaryByCreditScore"].columns.to_list()
for v in new_continuous:
EDA_continuous(df_fe,v,'Exited')
As it can be seen, some of these ratios are potentially impactful for our modelling task and some are not. We will not focus now on interpretation of these features (especially that behaviour of some of the original features is not entirely intuitive as we mentioned in the EDA section) as some of them will probably be dropped during the feature selection part of the project
Let us also inspect the 1/0 variables we got from One-Hot encoding of categorical variables:
one_hot_encoded = df_fe.loc[:,"FromGermany":"NumOfProductsMany"].columns.to_list()
for v in one_hot_encoded:
EDA_categorical(df_fe,v,'Exited')
It can be seen that all of them are potentially useful. It is worth to make some justification here:
Feature selection is also important step in any modelling task. By excluding features that have very little predictive value we can significantly reduce the computation time in modelling phase (which is especially important if we want to tune the model's hyperparameters) and usually obtain better models.
There are many techniques to select the features. Several examples would be:
In this project we will show the last one. But first of all we will show how to calculate IV:
# Definition of functions for WoE and IV calculation and summary
def woe(Nb,Pb,N,P,c = 0.5):
'''Function calculating WoE values based on supplied parameters
Args:
Nb, integer or array / series of integers: number of observations from negative class in bucket
Pb, integer or array / series of integers: number of observations from positive class in bucket
N, integer or array / series of integers: total number of observations from negative class in whole dataset
P, integer or array / series of integers: total number of observations from positive class in whole dataset
c, float: correction parameter, typically 0.5
Returns:
float or array / series of floats: calculated WoE value
'''
return np.log((Nb+c)*(P+c)/((N+c)*(Pb+c)))
def woe_transform(data,target,v):
'''Function adding to data additional variable - calculated WoE values for groups defined by supplied
variable v.
Args:
data, dataframe: data with column target and v
target, string: name of the 0/1 target variable
v, string: name of the column existing in data WoE will be calculated for buckets defined by each
distinct value of v so v is expected to be categorical with low number of categories.
Returns:
dataframe: data with column name 'WOE__{v}' with WoE values calculated for buckets defined by v
'''
# calculating total number of observations for each class
N = df_fe.query(target + " == 0").shape[0]
P = df_fe.query(target + " == 1").shape[0]
# typical correction parameter for WoE
c = 0.5
return data.assign(**{"WOE__"+v: lambda x: [woe(Tb-Pb,Pb,N,P,c) for Pb, Tb in
zip(x.groupby(v)[target].transform('sum'),
x.groupby(v)[target].transform('count')
)
]
}
)
def calculate_woe(data, target, v_categorical = None, v_continuous = None, no_buckets = 4):
'''Function ustilising function woe_transform to calculate WoE variables for supplied variables
Args:
data, dataframe: data with column target and columns specified in lists v_categorical and v_continuous
target, string: name of the 0/1 target variable
v_categorical, list of strings: names of the categorical columns for which we want to calculate corresponding
WoE variables. Each distinct value of given column will define separate bucket
v_continuous, list of strings: names of the continuous columns. First for each column the corresponding
bucketed version will be calculated using function bucket_continuous. These new columns will define buckets
on which corresponding WoE variables will be calculated
no_bucketes, integer: number of buckets we want to achieve after grouping each of the continuous variables
Returns:
dataframe: data with columns 'CAT__{v}' for each variable v from v_continuous representing their bucketed
versions and columns 'WOE__{v}' for each of these bucketed variables and for each variable from list
v_categorical
'''
result = data.copy()
if v_categorical is not None:
for v in v_categorical:
result = woe_transform(result,target,v)
if v_continuous is not None:
for v in v_continuous:
result = bucket_continuous(result, v, no_buckets)
result = woe_transform(result, target, 'CAT__'+v)
return result
def woe_summary(data, target):
'''Function to calculate woe summary for whole dataset
Args:
data, dataframe: data with column target
target, string: name of the 0/1 target variable
Returns:
df_summary, dataframe: summary of WoE calculation. It is created for variables with name starting with "WOE__"
(therefore it will work well with previously developed functions). "P" means positive class of target variable
(where it equals 1) and "N" means negative class of target variable (where it equals 0)
'''
# find all WoE_variables and corresponding original variables
WOE_variables = [col for col in data if col.startswith('WOE__')]
bucket_variables = [col.replace('WOE__','',1) for col in WOE_variables]
original_variables = [col.replace('CAT__','',1) for col in bucket_variables]
# initialize empty dataframe
df_summary = pd.DataFrame()
for v, b, w in zip(original_variables, bucket_variables, WOE_variables):
df_buckety = data.assign(
variable = v,
bucket = lambda x: x[b],
woe = lambda x: x[w],
N = lambda x: 1-x[target]
).groupby(["variable", "bucket", "woe"], as_index = False).agg(
min = (v, min),
max = (v, max),
total = ("bucket", 'count'),
P = (target, sum),
N = ("N", sum)
).assign(**{
"% total": lambda x: ['{:.2%}'.format(y) for y in x.total / sum(x.total)],
"% all P": lambda x: x.P / sum(x.P),
"% all N": lambda x: x.N / sum(x.N),
"P rate": lambda x: ['{:.2%}'.format(y) for y in x.P / x.total],
"IV": lambda x: (x["% all N"] - x["% all P"])*x["woe"],
}).assign(**{
"% all P": lambda x: ['{:.2%}'.format(y) for y in x["% all P"]],
"% all N": lambda x: ['{:.2%}'.format(y) for y in x["% all N"]]
})
df_summary = df_summary.append(df_buckety)
return df_summary
def print_woe_details(data_summary):
'''Function printing separate tabular summary for each variable based on summary prepared by function woe_summary
Args:
data_summary, dataframe: result of function woe_summary
Returns:
Nothing
'''
# get list of all variables included in the woe summary
v_list = data_summary.variable.unique()
for v in v_list:
print(v, ":")
df_buckety = data_summary[data_summary.variable == v]
# add one row for each variable with aggregates for whole dataset
total = pd.DataFrame({
"variable": df_buckety['variable'].max(),
"bucket": "TOTAL",
"woe": "",
"min": df_buckety['min'].min(),
"max": df_buckety['max'].max(),
"total": df_buckety['total'].sum(),
"P": df_buckety['P'].sum(),
"N": df_buckety['N'].sum(),
"% total": "100.00%",
"% all P": "100.00%",
"% all N": "100.00%",
"P rate": '{:.2%}'.format(df_buckety['P'].sum() / df_buckety['total'].sum()),
"IV": df_buckety.IV.sum()
}, index = [0])
display(df_buckety.append(total).reset_index().drop('index', axis = 1))
def iv_barplot(data_summary, title = ""):
'''Function printing barplot with IV values for each variable summarised in data_summary
Args:
data_summary, dataframe: result of function woe_summary
title, string: custom title of the plot
Returns:
Nothing
'''
# calculate Information Value of variable and apply groups based on one of typical benchmarks
plot_data = data_summary.groupby('variable', as_index = False)["IV"].sum().assign(
benchmark = lambda x: pd.cut(x.IV, [0,0.02,0.1,0.3,0.5,np.Inf],
labels = ['not useful','weak','moderate','strong', 'susp. strong'] )
).sort_values('IV', ascending = False)
fig, ax = plt.subplots(figsize=(13,9))
sns.barplot(y = 'variable', x = 'IV', hue = 'benchmark', data = plot_data, palette = 'RdYlGn', dodge=False, ax = ax)
# adding IV values next to the bars
for index, value in enumerate(plot_data.IV):
ax.annotate('{:.2}'.format(value), xy=(value+0.01, index+0.3))
ax.set_title(title)
ax.set(xlim=(0, plot_data.IV.max()+0.15))
plt.show()
The figure below shows the results of out Information Value calculation
df_fe_woe = calculate_woe(df_fe,'Exited',
# original treated as categorical + One-Hot encodings of some of them
v_categorical = categorical + one_hot_encoded,
# original treated as continuous + engineered ratios
v_continuous = continuous + new_continuous)
df_woe_summary = woe_summary(df_fe_woe, 'Exited')
iv_barplot(df_woe_summary, "Information Value for all analysed features")
Based on the obtained IV values:
Below we additionally present the details for IV and woe calculation (it can also be seen how the continuous features were divided into buckets - these are the same buckets as during our EDA):
print_woe_details(df_woe_summary)
Below we listed the variables we keep after verifying Information Values:
predictive_features = [x for x in df_woe_summary.variable.unique() if x not in
['HasCrCard','EstimatedSalary','Tenure','CreditScore','SalaryByCreditScore','FromSpain',
'Geography','Gender','NumOfProducts']
]
predictive_features
The next step is verification of correlations between these features. For some modelling techniques we will apply (for example logistic regression) it is important not to have to higly correlated features as it may limit model's performance. We will look both at the Pearson correlations (capturing linear association) and Spearman correlation (capturing any monotonic association). The threshold above which (in terms of absolute value) we consider the correlation is strong is usually selected from range 0.7-0.9. We will choose threshold = 0.8.
def correlation_plot(data, features = None, method = 'pearson', title = ""):
'''Function printing correlation heatmap based on correlations of each possible pair of features
Args:
data, dataframe: dataset with features between which we want assess the correlations
features, list of strings: list of features names on which we want to perform correlation analysis.
By default None - all collumns from data will be used then
method, string: argument passed to corr method of dataframe. Should be 'pearson', 'spearman' or 'kendall'
title, string: custom title of the plot
Returns:
Nothing
'''
if features is not None:
data_to_cor = data[features]
else:
data_to_cor = data
corr = data_to_cor.corr(method = method)
# Generate a mask for the upper triangle
mask = np.triu(np.ones_like(corr, dtype=bool))
# Add the mask to the heatmap
fig, ax = plt.subplots(figsize=(13,7))
sns.heatmap(corr, mask=mask, center=0, linewidths=1, annot=True, fmt=".2f", annot_kws={"size":13})
ax.set_title(title)
plt.show()
correlation_plot(df_fe, predictive_features, title = "Pearson's correlations of predictive features")
correlation_plot(df_fe, predictive_features, method = 'spearman', title = "Spearman's correlations of predictive features")
Based on the results above and selected threshold of 0.8 we will exclude the following features:
Our final set of variables for modelling is therefore as follows:
selected_features = [x for x in predictive_features if x not in
['BalanceByAge', 'BalanceBySalary', 'BalanceByTenure', 'BalanceByCreditScore', 'CreditScoreByAge']
]
selected_features
We will take one final look on selected features' Information Values and correlations to make sure that we did not miss anything:
iv_barplot(df_woe_summary[df_woe_summary.variable.isin(selected_features)], "Information Value for selected features")
correlation_plot(df_fe, selected_features, title = "Pearson's correlations of selected features")
correlation_plot(df_fe, selected_features, method = 'spearman', title = "Spearman's correlations of selected features")
Before starting the modelling we will split our data into train and test set (with 70:30 proportion which is a common choice). We will train models (and perform hyperparameter tuning using cross-validation) on the train set and verify the final candidates on the test set. We will stratify the split by target variable to guarantee that the proportion of customers who exited is equal on train and test sets
X_train, X_test, y_train, y_test = train_test_split(df_fe[selected_features], df_fe.Exited, test_size = 0.3,
stratify = df_fe.Exited, random_state = 3103)
print("train set: ")
display(y_train.value_counts().to_frame("cnt").assign(pr = lambda x: ['{:.2%}'.format(p) for p in x.cnt / sum(x.cnt)]))
print("test set: ")
display(y_test.value_counts().to_frame("cnt").assign(pr = lambda x: ['{:.2%}'.format(p) for p in x.cnt / sum(x.cnt)]))
# save and load python data structers to / from pickle files
def save_data(data, filename):
'''Saves almost any data structure into <filename>.pkl in working directory
Args:
data, any type: any data structure to save
filename, string: name of the file in which data will be saved
Returns:
Nothing
'''
pkl_file = open('./'+filename+'.pkl', 'wb')
pickle.dump(data, pkl_file)
pkl_file.close()
def load_data(filename):
'''Loads contents of the pkl file in working directory
Args:
filename, string: name of the file in which data will be loaded
Returns:
data, any type: any data structure saved in the chosen pkl file
'''
pkl_file = open('./'+filename+'.pkl', 'rb')
data = pickle.load(pkl_file)
pkl_file.close()
return data
# saving data preperation and split for further use
save_data([X_train, X_test, y_train, y_test], "train_test_split_list")
As we mentioned in the EDA, we have a case of class imbalance in our data - there are about 4 times more non-churners than churners. There are several ways of dealing with class imbalance. We can for example:
As the class imbalance is not that extreme in our case (like in for example detecting frauds where you can often see for example 1% of frauds, 99% non-frauds) we will not use any of these methods. We will, however, not use the default classes predicted by our models. We will be interested only in probabilities produced by these models and we will analyse what threshold would be the best in our case. Additinally, as part of our hyperparameters tuning we will test whether it is beneficial to increase the weights of observations from less frequent class.
We will use AUC (area under ROC curve) as the metric for scoring during hyperparameters tuning. We do not want to use accuracy because in case of class imbalance it is not a recommended metric. For each type of the model we will define grid of its hyperparameters to test and each set of parameters will be tested performing cross validation on training set. Parameters with the highest average AUC (across the folds playing role of test sets) will be selected for our best model from each type.
# loading the saved, prepared data for modelling and evaluation
[X_train, X_test, y_train, y_test] = load_data("train_test_split_list")
In case of logistic regression it is advisable to scale the features first in order to make their magnitude comparable. We will do it on the train set as part of our modelling pipeline, therefore when we will later use our model to predict on the unseen data, the scaling will be applied based on means and standard deviations from the train set. Other preprocessing steps can be also applied this way, for instance if we had missing data (fortunately it is not the case here) we would also be able to impute it directly in the modelling pipeline
def hyperparameter_tuning(X, y, model, grid, folds, repeats, scale = False, n_jobs = None, n_iter = None):
'''Performs hyperparameter tuning for supplied model type. Uses stratified cross validation, which means
that each fold contains approximately the same percentage of samples of each target class as the complete set.
Args:
X, dataframe: dataframe with features
y, 0/1 series: series representing target variable
model, sklearn model class: model type we want to tune and fit
grid, dictionary: dictionary of parameters for model we want to test
folds, integer: number of folds we will use in cross validation
repeats, integer: number of times we want to rerun splitting to folds.
Thus for each set of hyperparameters we fit folds * repeats models
scale, bool: whether to scale the features prior to fitting
n_jobs, integer: Number of jobs to run in parallel. None means 1, -1 means all processors
n_iter, integer: if None, we will apply GridSearchCV. Otherwise - RandomizedSearchCV with n_iter tested sets
of hyperparameters
Returns:
tuner, GridSearchCV / RandomizedSearchCV object: SearchCV object fitted to supplied X and y
'''
# setup the pipeline steps. We want to use pipeline to potentially add the scaling of features as part of
# fitting the model to the training data
steps = []
if (scale == True):
steps.append(('scaler',StandardScaler()))
steps.append(('model',model))
# create the pipeline
pipeline = Pipeline(steps)
# correct the grid to include modelling step name
parameters = { ("model__"+k).replace(':', ''): v for k, v in grid.items() }
# define cross validation strategy
cv = RepeatedStratifiedKFold(n_splits=folds, n_repeats=repeats, random_state=3103)
# create GridSearchCV / RandomizedSearchCV object
if (n_iter is None):
tuner = GridSearchCV(pipeline, parameters,
cv = cv, verbose = 1,scoring='roc_auc', n_jobs = n_jobs)
else:
tuner = RandomizedSearchCV(pipeline, parameters,
cv = cv, verbose = 1,scoring='roc_auc', n_jobs = n_jobs, random_state = 3103, n_iter = n_iter)
# fit to the supplied X, y
tuner.fit(X,y)
return tuner
def show_best_params(tuner):
'''Reads best hyperparameters from GridSearchCV / RandomizedSearchCV fitted object.
Removes the added "model__" from the hyperparameter_tuning function representing the
name of modelling step in that function
Args:
tuner, GridSearchCV / RandomizedSearchCV object: SearchCV object already fitted to some data
Returns:
dictionary: dictionary with original names of hyperparameters
'''
return { ("model__"+k).replace(':', '').replace('model__',''): v for k, v in tuner.best_params_.items() }
# logistic regression hyperparameters we want to test
hyperparams = {
# we allow both L1 and L2 regularization
"penalty": ['elasticnet'],
# Inverse of regularization strength
"C": [0.001, 0.01, 0.05, 0.1, 0.5, 1, 10],
# l1_ratio = 0 => L2 regularization, l1_ratio=1 => L1 regularization.
# For 0 < l1_ratio <1, the penalty is a combination of L1 and L2.
"l1_ratio": [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1],
# The balanced mode uses the values of y to automatically adjust weights inversely proportional to class frequencies
"class_weight": [None,'balanced'],
# we need to choose this solver if we want to use elasticnet
"solver": ['saga'],
"random_state": [3103]
}
logistic_tuned = hyperparameter_tuning(X_train, y_train, LogisticRegression(), hyperparams,
folds = 5, repeats = 1, scale = True, n_jobs = 2)
save_data(logistic_tuned, "logistic_tuned")
The tunning process took about 25 seconds on our mediocre computer. Best hyperparameters from cross validation are as follows:
logistic_tuned = load_data("logistic_tuned")
show_best_params(logistic_tuned)
The figure below summarises the results obtained during tunning:
def plot_hyperparameters_impact(tuner, title):
'''Summary of the mean scores obtained during hyperparameter tuning. Best mean AUC is printed on the plot
Args:
tuner, GridSearchCV / RandomizedSearchCV object: fitted SearchCV object with AUC scoring
title, string: custom title of the plot
Returns:
Nothing
'''
y = tuner.cv_results_['mean_test_score']
std = tuner.cv_results_['std_test_score']
# 90% confidence interval (with normality assumption) based on the mean and std of observed AUC values
data = pd.DataFrame({"y": y,"std": std}).sort_values(['y','std'], ascending = [True,False]).assign(
x = list(range(1,len(y)+1)),
y_upper = lambda d: d['y']+1.67*d['std'],
y_lower = lambda d: d['y']-1.67*d['std'])
fig, ax = plt.subplots(figsize=(13,7))
sns.lineplot(y = "y", x = "x", data = data, ax = ax, color = 'black')
plt.fill_between(x = 'x', y1 = 'y_lower', y2 = 'y_upper', data = data, color = 'lightgrey')
ax.set_xlabel("hyperparameters set id")
ax.set_ylabel("mean test fold AUC")
ax.set_title(title)
ax.annotate('{:.4}'.format(data.y.max()),
xy=(data.x.max(), data.y.max()),
xytext =(0.9*data.x.max(),(data.y.min()+data.y.max())/2),
arrowprops = {'facecolor':'black', 'width': 2})
#plt.grid()
plt.show()
plot_hyperparameters_impact(logistic_tuned,
'Logistic regression tuning\n Mean test fold AUC (sorted) with 90% CI')
Similarly as in case of logistic regression we will scale the features as part of our modelling pipeline as it is also beneficial for this method
# SVM hyperparameters we want to test
hyperparams = {
"kernel": ['rbf','sigmoid'],
# Inverse of regularization strength
"C": [0.001, 0.01, 0.05, 0.1, 0.5, 1, 10],
# kernel coefficient
'gamma': [0.00001, 0.0001, 0.001, 0.01, 0.1],
# The balanced mode uses the values of y to automatically adjust weights inversely proportional to class frequencies
"class_weight": [None,'balanced']
}
SVM_tuned = hyperparameter_tuning(X_train, y_train, SVC(), hyperparams,
folds = 5, repeats = 1, scale = True, n_jobs = 2)
save_data(SVM_tuned, "SVM_tuned")
The tunning process took almost 18 minutes on our mediocre computer. Best hyperparameters from cross validation are as follows:
SVM_tuned = load_data("SVM_tuned")
best_params_SVM = show_best_params(SVM_tuned)
best_params_SVM
plot_hyperparameters_impact(SVM_tuned,
'SVM tuning\n Mean test fold AUC (sorted) with 90% CI')
By default the SVM does not output probabilities. In scikit-learn we can allow it, although the docummentation states that it will slow down the fitting. That is why for the purpose of hyperparameter tunning we did not allow it. However, for our best model we would like to have the probabilities as they will be necessary to test various possible thresholds for our classification. For that reason, we will use our function designed for tunning but we will provide only one set of hyperparameters - the one that turned out the best out of tested grid but with probability calculation enabled
best_params_prob = { k.replace(':', ''): [v] for k, v in best_params_SVM.items() }
best_params_prob['probability'] = [True]
best_params_prob['random_state'] = [3103]
SVM_best_prob = hyperparameter_tuning(X_train, y_train, SVC(), best_params_prob,
folds = 2, repeats = 1, scale = True, n_jobs = 2)
save_data(SVM_best_prob, "SVM_best_prob")
For tree-based methods the prior scaling of features is not necessary therefore we will skip that step this time
# Random forest hyperparameters we want to test
hyperparams = {
# number of trees in the forest
"n_estimators": [100,300,500],
# number of features to consider when looking for the best split
"max_features": ['sqrt',0.5,0.7],
# minimum number of samples required to be at a leaf node
"min_samples_leaf" : [2,10,30],
# maximum depth of each tree
"max_depth": [None,3,6],
# The balanced mode uses the values of y to automatically adjust weights inversely proportional to class frequencies
"class_weight": [None,'balanced','balanced_subsample'],
"random_state": [3103]
}
random_forest_tuned = hyperparameter_tuning(X_train, y_train, RandomForestClassifier(), hyperparams,
folds = 5, repeats = 1, scale = False, n_jobs = 2)
save_data(random_forest_tuned, "random_forest_tuned")
The tunning process took about 24 minutes on our mediocre computer. Best hyperparameters from cross validation are as follows:
random_forest_tuned = load_data("random_forest_tuned")
best_params_rf = show_best_params(random_forest_tuned)
best_params_rf
plot_hyperparameters_impact(random_forest_tuned,
'Random forest tuning\n Mean test fold AUC (sorted) with 90% CI')
XGBoost as a tree-based method also does not require scaling of features prior to fitting the model - therefore we will skip that step. XGBoost is one of the most effective classifiers and it has many interesting hyperparameters to specify. Consequently, it is also challenging to effectively search for the best hyperparameters (in terms of required computational time). We defined a relatively small dictionary with up to 4 values to try for each of the selected hyperparameters. As it can be seen from our calculation below, the number of unique combinantions grows exponentially with each added hyperparameter and it can quickly get overwhelmingly high
# XGBoost hyperparameters we want to test
hyperparams = {
# number of boosting rounds = number of trees
"n_estimators": [100,300,500,1000],
# max depth of individual trees
"max_depth": [None,3,8],
# learning rates (eta)
"learning_rate": [0.001,0.01,0.1,0.3],
# minimum loss reduction required to make a further partition on a leaf
"gamma": [0,1,5,10],
# minimum sum of instance weight(hessian) needed in a child
"min_child_weight": [0,1,5],
# what share of training data use for each tree
"subsample": [0.7,1],
# what share of features use for each tree
"colsample_bytree": [0.7,1],
# L1 regularization of weights
"reg_alpha": [0, 1, 5],
# L2 regularization of weights
"reg_lambda": [1, 2, 5],
# obs. weights - 1 by default, in case of class imbalance the following is recommended :
# (num. obs. from more frequent class) / (num. obs. from less frequent class) ,
"scale_pos_weight": [1, sum(1-y_train) / sum(y_train) ],
"random_state": [3103],
"use_label_encoder": [False],
# starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic'
# was changed from 'error' to 'logloss'. We set it directly to prevent the warning
"eval_metric": ["logloss"]
}
print("size of the grid (number of all combinations):", len(ParameterGrid(hyperparams)))
For this reason, for XGBoost we will use Random Search, which will drow specified number of sets of hyperparameters from our grid
## time testing - 105 seconds for 10 iterations of RandomSearch
## it means ~ 175 minutes for 1000 iterations (roughly 3 hours)
# XGBoost_tuned = hyperparameter_tuning(X_train, y_train, xgb.XGBClassifier(), hyperparams,
# folds = 5, repeats = 1, scale = False, n_jobs = 2, n_iter = 10)
XGBoost_tuned = hyperparameter_tuning(X_train, y_train, xgb.XGBClassifier(), hyperparams,
folds = 5, repeats = 1, scale = False, n_jobs = 2, n_iter = 4000)
save_data(XGBoost_tuned, "XGBoost_tuned")
The tunning process took about 9.5 hours on our mediocre computer. Best hyperparameters from cross validation are as follows:
XGBoost_tuned = load_data("XGBoost_tuned")
best_params_xgboost = show_best_params(XGBoost_tuned)
best_params_xgboost
plot_hyperparameters_impact(XGBoost_tuned,
'XGBoost tuning\n Mean test fold AUC (sorted) with 90% CI')
We tested 4000 hyperparameters which is less than 10% of our defined grid which itself is only a tiny part of whole hyperparameters space for XGBoost. If we were to contiunue our search for hyperparameters allowing for even better results, we could analyse what information we have already gathered. For example we can compare the distributions of obtained test scores from cross validation for selected pairs of tested hyperparameters. Several examples can be seen on figures below. Such analysis can often show the regions of hyperparameters space it is worth to explore.
def plot_params_comparison(tuner, p1, p2):
'''Plots distribution of the AUC test scores from cross-validation (boxplots) separately for each combination
of hyperparameters p1 and p2
Args:
tuner, GridSearchCV / RandomizedSearchCV object: fitted SearchCV object with AUC scoring
p1, string: name of first hyperparameter (X axis)
p2, string: name of second hyperparameter (hue)
Returns:
Nothing
'''
data = pd.DataFrame(tuner.cv_results_)
# formatting scale_pos_weight for better looking legend on plots
data.param_model__scale_pos_weight = ['{:.2f}'.format(x) for x in data.param_model__scale_pos_weight]
# get the names of columns with test scores (5-fold cv -> 5 columns, 10-fold cv -> 10 columns, etc.)
test_scores = [col for col in data if (col.startswith('split') & col.endswith('_test_score'))]
# transform the data into long format
# note1: we are using individual scores from each test fold, not the mean test score used to select best parameters
# note2: this code is specific for tuners obtained by our function, where we used Pipeline and named modelling step
# 'model'. That is why we need to add not only 'param_' but also 'model__' to the name of parameter
melted_data = data.melt(id_vars = ['param_model__'+p1, 'param_model__'+p2], value_vars = test_scores).fillna(value='None')
melted_data.columns = [p1,p2,'var','test AUC']
sns.catplot(data = melted_data, kind = 'box', y = 'test AUC', x = p1, hue = p2, legend_out = True,
height = 7, aspect = 12/7, palette = 'Blues')
plt.title('boxplot: cross validation test AUC distribution - ' + p1 +' vs ' + p2)
plt.show()
We can clearly see which values of learning_rate and n_estimators works best for our data (although it is interesting that the box for combination that won our cross validatation - learning rate 0.1 and n_estimators = 1000 - is not the best in general)
plot_params_comparison(XGBoost_tuned, "learning_rate", "n_estimators")
Scale_pos_weight appropriate for our classes proportion if anything hindered the performance. Higher gamma values are better on our data:
plot_params_comparison(XGBoost_tuned, "gamma", "scale_pos_weight")
Regularization parameters did not have visible impact on performance (it may be because we have only 10 features. Typically the more features we use the higher the risk of picking up the noise instead and overfitting - this is when the regularization is needed the most)
plot_params_comparison(XGBoost_tuned, "reg_alpha", "reg_lambda")
We can see that the impact of min_child_weight is marginal while judging by the spread of results it seems to better leave max_depth without limit
plot_params_comparison(XGBoost_tuned, "min_child_weight", "max_depth")
From the figure below it seems that subsample = 1 and colsample_bytree = 0.7 have the slight edge in terms of observed models' performance (however for our cross validation winner both equaled 0.7)
plot_params_comparison(XGBoost_tuned, "subsample", "colsample_bytree")
The figures above showed that in case of our data if there is something worth exploring it is learning_rate and n_estimators parameters. We will define new small grid with some additional values of these parameters around currently selected as best. Other parameters will be set to the optimal values from our initial search
# additional XGBoost hyperparameters we want to test
hyperparams = {
# number of boosting rounds = number of trees
"n_estimators": [800,1000,1200],
# max depth of individual trees
"max_depth": [None],
# learning rates (eta)
"learning_rate": [0.005,0.008,0.01,0.012,0.015,0.02, 0.05,0.08,0.1,0.12,0.15,0.2],
# minimum loss reduction required to make a further partition on a leaf
"gamma": [10],
# minimum sum of instance weight(hessian) needed in a child
"min_child_weight": [1],
# what share of training data use for each tree
"subsample": [0.7],
# what share of features use for each tree
"colsample_bytree": [0.7],
# L1 regularization of weights
"reg_alpha": [0],
# L2 regularization of weights
"reg_lambda": [2],
# obs. weights - 1 by default, in case of class imbalance the following is recommended :
# (num. obs. from more frequent class) / (num. obs. from less frequent class) ,
"scale_pos_weight": [1],
"random_state": [3103],
"use_label_encoder": [False],
# starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic'
# was changed from 'error' to 'logloss'. We set it directly to prevent the warning
"eval_metric": ["logloss"]
}
print("size of the additional grid (number of all combinations):", len(ParameterGrid(hyperparams)))
As the number of combinations is small, this time we will use grid search and verify all possibilities
XGBoost_grid2_tuned = hyperparameter_tuning(X_train, y_train, xgb.XGBClassifier(), hyperparams,
folds = 5, repeats = 1, scale = False, n_jobs = 2)
save_data(XGBoost_grid2_tuned, "XGBoost_grid2_tuned")
The tunning process took about 11 miutes on our mediocre computer. Best hyperparameters from cross validation are as follows:
best_params_xgboost = show_best_params(XGBoost_grid2_tuned)
best_params_xgboost
plot_hyperparameters_impact(XGBoost_grid2_tuned,
'XGBoost tuning (grid 2)\n Mean test fold AUC (sorted) with 90% CI')
As it can be seen, we obtained marginally better result by adding additional 200 trees to our model but the best learning_rate remained the same. With more time and computational power we would probably be able to get even better results, however for the purpose of this project we will assess these results as sufficient - winner from grid 2 will advance to the next step of our project
In this section we will be comparing our models (best ones from each type obtained through hyperparameters tuning with cross validation on training set). Comparisons will be done on the test set which up to now was left completly aside and this will be first time our models will "see" these data
# loading the saved, prepared data for modelling and evaluation
[X_train, X_test, y_train, y_test] = load_data("train_test_split_list")
# loading the saved best models from each type
LR = load_data("logistic_tuned").best_estimator_
SVM = load_data("SVM_best_prob")
RF = load_data("random_forest_tuned").best_estimator_
XGB = load_data("XGBoost_grid2_tuned").best_estimator_
We will start by looking at the ROC curves and calculating the AUC for these models on the test set. The results are presented on the figures below
def compare_ROC_curve(models, models_names, X, y, FPR_range = (0,1)):
'''Function plotting ROC curves for provided list of models based on supplied data and targets.
Additionally AUC is calculated and presented in the legend
Args:
models, list: list of models (from class with method predict_proba)
models_names, list of strings: list of names to show on the legend
X, dataframe: dataframe with features used by supplied models
y, series with 0/1 values: list with values of target variable (for corresponding rows of X)
FPR_range, tuple: range of FPR values we want to see (it allows to zoom the plot on interesting range of FPR)
Returns:
Nothing
'''
# initialize empty dataframe
plot_data = pd.DataFrame()
# loop through models list
for i, m in enumerate(models):
# predict and gather probabilities for class 1
y_pred_proba = m.predict_proba(X)[:,1]
# calculate points for roc curve
fpr, tpr, thresholds = metrics.roc_curve(y, y_pred_proba)
# calculate AUC
auc = metrics.roc_auc_score(y, y_pred_proba)
# add the results to data for plot (long format to hue by model)
plot_data = plot_data.append(
pd.DataFrame({"FPR" : fpr, "TPR" : tpr}).assign(
model = models_names[i] + ' (AUC = {:.4f})'.format(auc)
))
fig, ax = plt.subplots(figsize=(13,9))
sns.lineplot(x = 'FPR', y = 'TPR', hue = 'model', style = 'model', data = plot_data, ax = ax)
plt.plot([0, 1], [0, 1], "k--")
ax.set_title('ROC curves for compared models')
ax.set_xlabel('False positive rate')
ax.set_ylabel('True positive rate')
ax.set(xlim=FPR_range)
plt.yticks(np.linspace(0,1,11))
plt.grid(linestyle='dotted', linewidth=1)
plt.show()
compare_ROC_curve([LR, SVM, RF, XGB], ["LR", "SVM", "RF", "XGB"], X_test,y_test)
compare_ROC_curve([LR, SVM, RF, XGB], ["LR", "SVM", "RF", "XGB"], X_test,y_test, (0,0.3))
There are several observations to make here:
As we already hinted when noticing class imbalance - we would like to focus on positive class, i.e. churners. Instead of looking at the False positive rate, where in denominator we have the total number of Negative class, we may want to inspect precision which is % of true positives in all predicted positive. Typically curve after such change on X axis is called precision-recall curve (as other name for True positive rate is recall). Additionally we will calculate 2 measures connected with precision and recall:
def compare_PR_curve(models, models_names, X, y, P_range = (0,1)):
'''Function plotting precision-recall curves for provided list of models based on supplied data and targets.
Additionally max(F1_score) and average precision is calculated
Args:
models, list: list of models (from class with method predict_proba)
models_names, list of strings: list of names to show on the legend
X, dataframe: dataframe with features used by supplied models
y, series with 0/1 values: list with values of target variable (for corresponding rows of X)
P_range, tuple: range of precision values we want to see
(it allows to zoom the plot on interesting range of precision)
Returns:
Nothing
'''
# initialize empty dataframe
plot_data = pd.DataFrame()
# loop through models list
for i, m in enumerate(models):
# predict and gather probabilities for class 1
y_pred_proba = m.predict_proba(X)[:,1]
# calculate points for roc curve
precision, recall, thresholds = metrics.precision_recall_curve(y, y_pred_proba)
# calculate max f1 score
f1 = max(2*precision*recall/(precision + recall))
# calculate average precision
ap = metrics.average_precision_score(y, y_pred_proba)
# add the results to data for plot (long format to hue by model)
plot_data = plot_data.append(
pd.DataFrame({"precision" : precision, "recall" : recall}).assign(
model = models_names[i] + ' (max(F1) = {:.2%}, AP = {:.2%})'.format(f1, ap)
))
fig, ax = plt.subplots(figsize=(13,9))
sns.lineplot(x = 'precision', y = 'recall', hue = 'model', style = 'model', data = plot_data, ax = ax)
ax.set_title('Precision-recall curves for compared models')
ax.set_xlabel('Precision (% true positives in all predicted positive)')
ax.set_ylabel('Recall (% true positives in all actually positive)')
ax.set(xlim=P_range)
plt.yticks(np.linspace(0,1,11))
plt.grid(linestyle='dotted', linewidth=1)
plt.show()
compare_PR_curve([LR, SVM, RF, XGB], ["LR", "SVM", "RF", "XGB"], X_test,y_test)
On the figure above we can see the same ranking of our models as in case of AUC and ROC (in general LR < SVM < RF < XGB in terms of predictive power). It supports the argument that the AUC metric is not a wrong choice for cross validation even in case of class imbalance. Of course one could argue it would be better to use average precision for cross validation but the fact that ranking of our models remains the same makes us believe we would not get considerably better models that way (at least not in case of this dataset)
After implementing the classification model we always have to decide how it will be making decisions. The default classes predicted by models are based on cut-off 50% i.e if the probability of class 1 is higher than 50% then the model will predict class 1. But we are not forced to use this default threshold given that we can calculate the probabilities. It is especially crucial for data with class imbalance such as in case of this project (as it may to be simply too rare for our data to achieve probability higher than 50% and the model would predict class 1 too rare for our needs).
The important factor here are our costs - does treating the observation from positive class as negative costs us as much as falsely saying that observation from negative class is in positive (and thus making some actions)? In case of churn problem the cost of acquiring a new customer is usually higher than the cost of keeping the current one (which can often be a matter of phone call and identifying the customer's problems / needs). It may especially be the case for smaller portfolios of rich customers. For such cases we would like to see high recall (we want to capture as many observations from positive class - here churners - as possible) and we can allow some share of false positives i.e. we can accept lower precision (we can afford the call for some customers who would not churn even without our action).
One way of selecting the cut-off would be for example stating that we can handle up to 15% of false positive rate and reading from ROC curve what true positive rate we can get (but it would require some cost assesment for our particular problem). We will test and present here 3 ways of calculating "optimal" cut-off:
We will compare these 3 approaches on the test set (we assume that cut-off value assesment performed on the test set will yield similar results as we will observe on new data after model's implementation)
def MCC(TP,TN,FP,FN):
'''Calculates Matthews correlation coefficients
Args:
TP, array / series: True positive counts
TN, array / series: True negative counts
FP, array / series: False positive counts
FN, array / series: False negative counts
Returns:
array / series: calculated MCC values
'''
return ((TP*TN)-(FP*FN))/(np.sqrt((TP+FP)*(TP+FN)*(TN+FP)*(TN+FN)))
def cutoff_assesment(models, models_names, thresholds, X, y, beta):
'''Produce classification diagnostics for given models, probability thresholds, features data and target
Args:
models, list: list of models (from class with method predict_proba)
models_names, list of strings: list of names to add to the dataframe
thresholds, array: list of probability thresholds above which we will predict positive class
X, dataframe: dataframe with features used by supplied models
y, series with 0/1 values: list with values of target variable (for corresponding rows of X)
beta, float: one of calculated measures is Fbeta which uses this parameter. For values > 1,
Fbeta will treat recall as more important than precision
Returns:
results, dataframe: dataframe with row for each model and threshold with the following
classification diagnostics: TP, FN, FP, TN, precision, recall, accuracy, MCC, F1, Fbeta.
There are 3 additional columns indicating the rows with maximal MCC, F1 and Fbeta for each model
'''
# initialize empty dataframe for results
results = pd.DataFrame()
for i, m in enumerate(models):
# predict and gather probabilities for class 1
y_pred_proba = m.predict_proba(X)[:,1]
# dataframe with column for each threshold and row for each observation
# with True if predicted probability > threshold and False otherwise
preds_matrix = pd.DataFrame(np.array([(y_pred_proba > x) for x in thresholds]), index = thresholds).T
# combine results for each model
results = results.append(
# dataframe with row for each threshold and classification diagnostics in columns
pd.concat(
[pd.DataFrame({"model" : models_names[i],
"cut_off" : x,
"TP" : [sum(np.array(preds_matrix[x] == True)*np.array(y == 1))],
"FN" : [sum(np.array(preds_matrix[x] == False)*np.array(y == 1))],
"FP" : [sum(np.array(preds_matrix[x] == True)*np.array(y == 0))],
"TN" : [sum(np.array(preds_matrix[x] == False)*np.array(y == 0))]})
for x in preds_matrix]).reset_index(drop = True).assign(
precision = lambda x: x.TP / (x.TP + x.FP),
recall = lambda x: x.TP / (x.TP + x.FN),
accuracy = lambda x: (x.TP + x.TN)/(x.TP+x.FN+x.FP+x.TN),
MCC = lambda x: MCC(x.TP,x.TN,x.FP,x.FN),
F1 = lambda x: 2*(x.precision*x.recall)/(x.precision+x.recall),
Fbeta = lambda x: (1+beta**2)*(x.precision*x.recall)/((beta**2)*x.precision+x.recall),
max_MCC = lambda x: x.MCC == np.max(x.MCC),
max_F1 = lambda x: x.F1 == np.max(x.F1),
max_Fbeta = lambda x: x.Fbeta == np.max(x.Fbeta)
)
)
return results
def plot_best_cutoff(cutoff_assesment_df, method = "max_MCC"):
'''Produce bar plots of classification diagnostics based on data from cutoff_assesment function.
Diagnostics are based on the best (in terms of selected method) probability threshold for each model
Args:
cutoff_assesment_df, dataframe: dataframe produced by cutoff_assesment function
method, string: Has to be one of "max_MCC", "max_F1" or "max_Fbeta"
Returns:
Nothing
'''
plots_data = cutoff_assesment_df.query(method + " == True"). \
drop_duplicates(subset = [n for n in list(cutoff_assesment_df.columns) if (n != 'cut_off')])
fig, ax = plt.subplots(nrows = 4, ncols = 2, figsize = (15, 12), sharex = True)
for i, y in enumerate([method.replace("max_",""), "accuracy", "precision", "recall", "TP", "FN", "FP", "TN"]):
a = ax[int(np.floor(i/2))][int(i-np.floor(i/2)*2)]
sns.barplot(x = "model", y = y, data = plots_data, ax = a)
a.set_title(y)
a.set_xlabel("")
a.set_ylabel("")
# adding labels for bars
for p in a.patches:
if (i < 4):
a.annotate('{:.2%}'.format(p.get_height()), (p.get_x()+0.1, 0.05), fontsize = 15)
else:
a.annotate(int(p.get_height()), (p.get_x()+0.2, 40), fontsize = 15)
plt.suptitle(method + ": obtained max value + other classification diagnostics")
plt.show()
cutoff_data = cutoff_assesment([LR, SVM, RF, XGB], ["LR", "SVM", "RF", "XGB"],
# 1000 possible cut-off values
np.linspace(0,1,1000),
X_test, y_test, beta = 2)
Figure below shows the results for cut-off value based on maximising MCC. XGBoost achieved the highest MCC and recall and SVM scored best precision and accuracy marginally higher than XGBoost (which was second best in terms of precision and accuracy). Given that we mostly cared about highest MCC, XGBoost is our winner here.
plot_best_cutoff(cutoff_data, "max_MCC")
Selecting cut-offs maximising F1 we obtained interesting differences between the models:
plot_best_cutoff(cutoff_data, "max_F1")
Fbeta with beta = 2 is not as balanced as F1 and MCC but this is our intention as we want to indicate the importance of recall (easier and cheaper to retain customer than acquire a new one). Although XGBoost scored the lowest recall here, it allowed for highest precision and overall highest Fbeta (closely followed by random forest), so it is again a winner for a measure we wanted to maximise. Going back to our precision-recall plot we see that if we were to state the minimal precision we can afford (in terms of costs of additional work with false positives) then XGBoost most often would allow us to get highest recall (for some precision values random forest is better in terms of recall). We also see the highest accuracy for XGBoost although it should be reminded that if we care only about accuracy even predicting "no churn" for all customers would have higher accuracy (as we have almost 80% of customers who did not churn). If we want to maximise accuracy, with XGBoost and Random forest we could exceed 86% (random forest slightly outperforms XGBoost in that regard)
plot_best_cutoff(cutoff_data, "max_Fbeta")
# verifying maximal accuracy of our models
cutoff_data.groupby('model', as_index = False).agg(max_accuracy = ('accuracy',max)).sort_values('max_accuracy')
In the previous subsections we showed that out of 4 models chosen as the best of their kind after some amount of hyperparameters tuning with cross validation, our XGBoost model has overall the highest performance. As we discussed, cut-off value above which we would predict positive class (here churn) can be selected in many ways. It is all a matter of stating what is the most important for us and analysing the costs of false positives vs false negatives. Because in churn problems often high recall is what we need, we will choose the cut-off maximising Fbeta with beta = 2 out of tested ones.
# cut-offs we calculated - we select one with max_Fbeta but will show other tested for reference
cutoff_data.query("model == 'XGB' & (max_Fbeta == True | max_MCC == True | max_F1 == True)"). \
drop_duplicates(subset = ['TP', 'TN'])
We will modify the predict method of our XGBoost model to implement the cut-off we chose. Below we can see the results which are in accordance with summary above when it comes to prediction with custom cut-off. We can see that:
# we are doing a deepcopy to change only final_model's predict method
# we additionally get the model out of the Pipeline object so it will be the XGBClassifier object
# (as for this type of model we did not preprocess the data as part of the Pipeline and modelling step was the only one)
final_model = deepcopy(XGB.named_steps.model)
# custom predict method with cut_off we selected
def predict(self, X):
preds = self.predict_proba(X)[:,1]
return (preds > 0.162162)*1
# this is how we can override class method for existing object
final_model.predict = types.MethodType(predict, final_model)
th = ["default", "custom"]
for t in th:
print("classification with", t, "predict:")
m = XGB if t == "default" else final_model
display(pd.DataFrame({"exited": y_test, t+" prediction": m.predict(X_test)}). \
groupby(["exited", t+" prediction"]).size().to_frame("cnt"). \
assign(pr = lambda x: ['{:.2%}'.format(p) for p in x.cnt / sum(x.cnt)],
pr_per_exited = lambda x: ['{:.2%}'.format(p) for p in x.cnt / x.groupby("exited")["cnt"].transform('sum')]))
# saving the final model
save_data(final_model, "final_model")
Last test would be simulation of what we have to do to use our model on raw data. Luckily it is not much. Below we can see the code and results of predicting with our model on our whole dataset
# reading the downloaded file
df = pd.read_csv('./Churn_Modelling.csv')
# getting feature names
features = final_model.get_booster().feature_names
# how to predict on new data. We use add_new_features function from Feature engineering section
pd.Series(final_model.predict(add_new_features(df)[features])).value_counts()
One final thing worth discussing is interpretability of our model. We know it predicts probabilities whether the customer will churn or not, we selected the cut-off value which reflect our effort to catch as many churners as we can while still maintaining affordable number of false positives. But do we know why the model gave this particular probability to a given customer? Not yet.
XGBoost is so called black-box method which means it is hard to directly show how it derived given result from data. Our particular model consists of 1200 individual decision trees and is impossible to present it in understandable manner. But we want that! In case of attrition problem we would probably use our model to identify potential churners and then contact with them and make some actions. It would be useful for person contacting the customer to know why the model thinks this customer will end relationship with our company. It may be a vital help to define potential actions which will change our client's mind. In case of attrition problem we can have 100% accurate model but without our actions to counter churn we could as well not bother developing the model at all.
There are some methods to understand what drives the model's decisions (even so complicated as XGBoost with so many trees). For example we can analyse Shapley values originating from game theory implemented in the shap library. There are other techniques implemented in libraries like Dalex and Lime but we will focus only on shap.
Let's start with importing our test set and our final model
# loading the saved, prepared data for modelling and evaluation
[X_train, X_test, y_train, y_test] = load_data("train_test_split_list")
final_model = load_data("final_model")
# unfortunately we verified that default saving with pickle.dump did not save our override of predict method
# while this is something definitely worth to look into we will not do it as part of this project and simply
# repeat the code of overriding the predict method to use our selected cut-off
def predict(self, X):
preds = self.predict_proba(X)[:,1]
return (preds > 0.162162)*1
final_model.predict = types.MethodType(predict, model)
We will use TreeExplainer as our model is tree based but there are options available also for other types of models. Then based on explainer object we can calculate Shapley values for our test set
# our model is in fact Pipeline object - if we want to use TreeExplainer we need to directly access the modelling step
explainer = shap.TreeExplainer(final_model)
shap_values = explainer.shap_values(X_test)
explained_vals = explainer(X_test)
We will start from one summary picture which contains huge amount of useful information. This is how to interpret it:
shap.summary_plot(shap_values, X_test, plot_size = (15,10))
Feature importance can be assesed in various ways, for example for XGBoost directly using the funtion plot_importance from xgboost library. We can use various types of importance - below we chose gain which is based on average gain of splits which use given feature. We can see that the order is similar but not identical - especially when we look at NumOfProductsMany. This feature rarely has values other than 0 but when it has it provides great differentiation between customers (hence the average gain of not that many splits that use it is high)
xgb.plot_importance(final_model, importance_type = 'gain', show_values = False, height = 0.8)
plt.show()
Using weight which summarises the number of times the feature appears in the trees visibly changes the picture for some features (like NumOfProcutsMany and Balance)
xgb.plot_importance(final_model, importance_type = 'weight', show_values = False, height = 0.8)
plt.show()
Going back to shap - we can also look at the impact of each individual feature with more details. Below we presented dependence plots from shap library. On these scatter plots for each observation we see:
# we will not make this code into a function - we use it one time and subplots' nrow and ncols are designed
# for our dataset with 10 features
fig, ax = plt.subplots(nrows = 5, ncols = 2, figsize = (15, 20), sharex = False)
f = X_test.columns
for i in range(10):
a = ax[int(np.floor(i/2))][int(i-np.floor(i/2)*2)]
shap.dependence_plot(f[i], shap_values, X_test, ax = a, show = False)
plt.show()
We can also play with interactive force plot (look at the html version of this notebook to experience interactivity). While we will not explain this plot in details, we can get potential insights while selecting the same variable on both axes
shap.initjs()
display(shap.force_plot(explainer.expected_value, shap_values, X_test))
One final look inside our model, potentially the most useful one in terms of choosing our actions to prevent customer from churning, would be to analyse the results obtained for particular customers. Let us select various customers and inspect what are the drivers of their predicted probability of churn
X_test_plus_prediction = X_test.reset_index(drop = True).assign(
pred = final_model.predict(X_test),
prob = final_model.predict_proba(X_test)[:,1],
cust_no = lambda x: x.index)
# during this "sanity check" we noticed that pickle.dump did not save our predict method override
# X_test_plus_prediction.groupby('pred')['prob'].agg([min,max,'count'])
# we specify different scenarios to see different profiles of churn probability drivers
customer_1 = X_test_plus_prediction.query("prob > 0.8 & NumOfProductsMany == 1 & FromGermany == 0 & IsFemale == 0").head(1)
customer_2 = X_test_plus_prediction.query("prob > 0.9 & NumOfProductsMany == 0").head(1)
customer_3 = X_test_plus_prediction.query("pred == 0 & NumOfProductsTwo == 0").head(1)
customer_4 = X_test_plus_prediction.query("pred == 0 & NumOfProductsTwo == 1").head(1)
# showing the results in table
cust = pd.concat([customer_1.T,customer_2.T,customer_3.T, customer_4.T],axis = 1)
cust.columns = (["customer 1", "customer 2", "customer 3", "customer 4"])
display(cust.iloc[0:-1,:])
# the following function (along with comments inside) is a direct copy from this interesting article:
# https://medium.com/dataman-in-ai/the-shap-with-more-elegant-charts-bc3e73fa1c0c
# In general it allows to transform the output from XGBoost models to probabilities to have better interpretation
# of shap waterfall charts
def xgb_shap_transform_scale(original_shap_values, Y_pred, which):
from scipy.special import expit
#Compute the transformed base value, which consists in applying the logit function to the base value
from scipy.special import expit #Importing the logit function for the base value transformation
untransformed_base_value = original_shap_values.base_values[-1]
#Computing the original_explanation_distance to construct the distance_coefficient later on
original_explanation_distance = np.sum(original_shap_values.values, axis=1)[which]
base_value = expit(untransformed_base_value ) # = 1 / (1+ np.exp(-untransformed_base_value))
#Computing the distance between the model_prediction and the transformed base_value
distance_to_explain = Y_pred[which] - base_value
#The distance_coefficient is the ratio between both distances which will be used later on
distance_coefficient = original_explanation_distance / distance_to_explain
#Transforming the original shapley values to the new scale
shap_values_transformed = original_shap_values / distance_coefficient
#Finally resetting the base_value as it does not need to be transformed
shap_values_transformed.base_values = base_value
shap_values_transformed.data = original_shap_values.data
#Now returning the transformed array
return shap_values_transformed
for i, _ in enumerate(cust.columns):
ind = int(cust.loc["cust_no"][i])
shap_values_transformed = xgb_shap_transform_scale(explained_vals, X_test_plus_prediction.prob, ind)
print("Impact of features on model's decision for customer", i+1)
shap.plots.waterfall(shap_values_transformed[ind])
Not only we see all features' values for a given customer, we also can get an idea how these values contributed to the final probability. For example:
In general - the more interested and complicated features we have, the more useful insights and hints for our preventive actions we can gather from such plots. These plots would definitely provide added value if they were implemented along with our model in the applications used by people potentially calling to customers (for example their private advisors).
During described analyses we covered most steps usually taken during successful model development projects. Typically though: